Back in 2014, fivethiryeight.com published an article on alcohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)
# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")What are the variable types? Any missing values we should worry about?
The variable types are characters and numericals (Integer and double). There are no missing values.
glimpse(drinks)## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "And~
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2~
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,~
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191~
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, ~
skimr::skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
drinks %>%
slice_max(order_by = beer_servings, n=25) %>%
ggplot(aes(x = beer_servings, y = reorder(country,beer_servings))) +
geom_col(fill = "chocolate")+
theme_bw()+
labs(
title = "Top 25 beer consuming countries",
subtitle = "Bar Chart",
x = "Beer Servings",
y = NULL
)Comment on top 25 beer consuming countries:
The African countries Namibia and Gabon exhibit especially high beer consumptions given their colonial history: Namibia used to be a German colony in the 20th century, and Gabon a French one. Both countries have relatively small populations (around 2 million), and their life expectancies are also relatively low (around 65 years), which also means that older people who may consume less beer are not included in the sample size.
drinks %>%
slice_max(order_by = wine_servings, n=25) %>%
ggplot(aes(x = wine_servings, y = fct_reorder(country,wine_servings))) +
geom_col(fill = "dark red")+
theme_bw()+
labs(
title = "Top 25 wine consuming countries",
subtitle = "Bar Chart",
x = "Wine Servings",
y = NULL
)Comment on top 25 wine consuming countries:
Andorra is a country that borders France, the highest wine consuming country, and has a population of only 77,000. Equatorial Guinea has a history of being under Portuguese and French rule which also explains its high wine consumption.
drinks %>%
slice_max(order_by = spirit_servings, n=25) %>%
ggplot(aes(x = spirit_servings, y = fct_reorder(country,spirit_servings))) +
geom_col(fill = "White", colour = "black")+
theme_bw()+
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank()) +
labs(
title = "Top 25 spirit consuming countries",
subtitle = "Bar Chart",
x = "Spirit servings",
y = NULL
)Comment on top 25 spirit consuming countries:
Grenada had a Liquor Licence Act passed in 1901 which allowed citizens to consume unlimited amounts of alcohol legally, and this has had a profound impact on the drinking culture in the country. Many locals are arguing that more regulation needs to be passed to improve the situation. Many of the countries in the top 25 are in the Caribbean which has a strong spirit culture centred around its rum manufacturing capabilities.
What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
What we can infer from these plots is that alcohol consumption can easily be traced through cultural and historical factors. For example, Namibia used to be a German colony and hence consumes the most beer, France is largely where viticulture originated, and Grenada, the top spirit consumer, is in the Caribbean which has a history of spirit manufacturing. What is interesting is that very few countries are repeatedly seen in the top 25 of either beer, wine or spirits. What this shows is that once a country is high in one type of alcohol, it tends to consume very little of other types.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge~
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "~
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow~
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20~
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1~
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, ~
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, ~
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920~
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9~
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35~
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, ~
skimr::skim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingAre there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
There are no missing values, but there are some duplicate entries. That is why we created the variable “unique_movies” to account for duplicates in analyzing our data.
Produce a table with the count of movies by genre, ranked in descending order
unique_movies <- movies %>%
distinct(title,genre,director,year,.keep_all = TRUE)
movies_count_genre<-unique_movies %>%
group_by(genre) %>%
count(sort=TRUE)
head(movies_count_genre)## # A tibble: 6 x 2
## # Groups: genre [6]
## genre n
## <chr> <int>
## 1 Comedy 844
## 2 Action 719
## 3 Drama 484
## 4 Adventure 281
## 5 Crime 198
## 6 Biography 135Comment:
Genres that cater to a wider crowd tend to have more demand and therefore more are produced (i.e. comedy and action films). Other genres such as crime, biography and drama tend to require more specific interests which cater to a smaller crowd and thus garner less demand.
Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
returns_movies <- unique_movies %>%
group_by(genre)%>%
summarise(mean_gross = mean(gross),
mean_budget = mean(budget)
) %>%
mutate(return_on_budget = (mean_gross-mean_budget)/mean_budget) %>%
arrange(-return_on_budget)
head(returns_movies)## # A tibble: 6 x 4
## genre mean_gross mean_budget return_on_budget
## <chr> <dbl> <dbl> <dbl>
## 1 Musical 92084000 3189500 27.9
## 2 Family 149160478. 14833333. 9.06
## 3 Western 20821884 3465000 5.01
## 4 Documentary 17353973. 5887852. 1.95
## 5 Horror 37782310. 13804379. 1.74
## 6 Fantasy 41902674. 18484615. 1.27Comment:
Musicals have the highest return on budget by far. This is because their mean budget tends to be very low compared to horror and fantasy movies which have very high mean budgets because they require both special effects and live actors. Musicals also tend to exhibit many elements that family films also have which lets them draw in a wider audience.
Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
top_directors<- unique_movies %>%
group_by(director) %>%
summarise(
sum_gross=sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
sd_gross = sd(gross)) %>%
arrange(desc(sum_gross))
head(top_directors, 15)## # A tibble: 15 x 5
## director sum_gross mean_gross median_gross sd_gross
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051.
## 2 Michael Bay 2195443511 182953626. 168468240. 125789167.
## 3 James Cameron 1909725910 318287652. 175562880. 309171337.
## 4 Christopher Nolan 1813227576 226653447 196667606. 187224133.
## 5 George Lucas 1741418480 348283696 380262555 146193880.
## 6 Robert Zemeckis 1619309108 124562239. 100853835 91300279.
## 7 Tim Burton 1557078534 111219895. 69791834 99304293.
## 8 Sam Raimi 1443167519 180395940. 138480208 174705230.
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408.
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020.
## 11 Ron Howard 1335988092 111332341 101587923 81933761.
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822.
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765.
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773.
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285.
Comment:
Steven Spielberg has a much higher gross revenue than the other directors. This can be attributed to the fact that he directed many films which are part of bigger franchise: e.g. Jurassic Park, Indiana Jones and Jaws. James Cameron has a disproportionately high standard deviation as two of his films (Titanic and Avatar) have grossed much more revenue than his other films.
Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
rating_distribution <- unique_movies %>%
group_by(genre) %>%
summarise(
min_rating=min(rating),
max_rating = max(rating),
mean_rating = mean(rating),
median_rating = median(rating),
sd_rating = sd(rating))
head(rating_distribution)## # A tibble: 6 x 6
## genre min_rating max_rating mean_rating median_rating sd_rating
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Action 2.1 9 6.23 6.3 1.04
## 2 Adventure 2.3 8.6 6.51 6.6 1.11
## 3 Animation 4.5 8 6.65 6.9 0.968
## 4 Biography 4.5 8.9 7.11 7.2 0.760
## 5 Comedy 1.9 8.8 6.11 6.2 1.02
## 6 Crime 4.8 9.3 6.92 6.9 0.853
Rating_histogram <- ggplot(unique_movies, aes(x = rating)) +
geom_histogram()+
theme_bw()+
labs(title = "Rating Distribution",
subtitle = "Bar Chart",
x = "Rating",
y = "Count"
)
Rating_histogramComment:
There is less correlation between gross earnings and ratings for the comedy and horror genres because people often watch these types of films either to have a laugh or be scared regardless of the quality of the film or the plot. On the other hand, action and adventure films are more highly regarded based on the ratings they obtain as people watch these movies based on their perceived quality. The count of film ratings appears to be normally distributed, but is skewed to the left as most movies tend to be within the 5-7.5 rating range. The reason for this skewness is because filmmakers would not produce films unless they expected them to produce a rating of at least 5.
ggplot to answer the followinggross and cast_facebook_likes. Produce a scatter plot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?ggplot(unique_movies, aes(y = gross, x = cast_facebook_likes)) +
geom_point()+
geom_smooth()+
theme_bw()+
labs(
title = "Relationship between Gross Earnings and Facebook Likes",
subtitle = "Scatter Plot",
y = "Gross Earnings (US$)",
x = "Facebook Likes"
)+
scale_y_log10()+
scale_x_log10()+
NULL Comment:
There is no strong positive relationship between gross earnings and Facebook likes. Having higher Facebook likes does not necessarily mean higher revenues as likes do not necessarily translate into people going out to spend money and watch the film. Nevertheless, the highest grossing films do tend to have quite a few Facebook likes as they are very popular. Our x-axis is Facebook likes and our y-axis is Gross Earnings because we are trying to test if Facebook likes is a good predictor of Gross Earnings.
gross and budget. Produce a scatter plot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.ggplot(unique_movies, aes(y = gross, x = budget)) +
geom_point()+
geom_smooth()+
theme_bw()+
labs(
title = "Relationship between Gross Earnings and Budget",
subtitle = "Scatter Plot",
y = "Gross Earnings (US$)",
x = "Budget"
)+
scale_y_log10()+
scale_x_log10()+
NULL Comment:
There is not a strong positive correlation between gross earnings and budget, until the budget surpasses 1 million. After this point, there begins to be a stronger positive correlation. The sample size of films with a budget of less than 1 million is also very small. There are quite a few films with high budgets which flop at the box office, which shows why the correlation is not that strong.
gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?ggplot(unique_movies, aes(x = rating, y = gross)) +
geom_point()+
facet_wrap(~genre)+
geom_smooth()+
theme_bw()+
labs(
title = "Relationship between Gross Earnings and Rating for each Genre",
subtitle = "Faceted Scatter Plot",
x = "Rating",
y = "Gross Earnings (US$)"
)+
#scale_y_log10()+
#scale_x_log10()
NULL Comment:
There is less correlation between gross earnings and ratings for the comedy and horror genres because people often watch these types of films either to have a laugh or be scared regardless of the quality of the film or the plot. On the other hand, action and adventure films are more highly regarded based on the ratings they obtain as people watch these movies based on their perceived quality. Biography films also exhibit an interesting dynamic: there is a positive correlation between earnings and ratings until a certain point where it seems the highest rated biography films have lower earnings than expected. Similarly interesting patterns can be found in mystery and sci-fi films which peak in their earnings in the middle of their ratings distribution. However, these trends can also be explained by the small dataset.
nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.library(tidyquant)
nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)## Rows: 508
## Columns: 6
## $ symbol <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "~
## $ name <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie ~
## $ ipo_year <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999~
## $ sector <chr> "Health Care", "Consumer Durables", "Health Care", "Heal~
## $ industry <chr> "Medical/Dental Instruments", "Electrical Products", "Ma~
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq~
skimr::skim(nyse)| Name | nyse |
| Number of rows | 508 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| symbol | 0 | 1 | 1 | 4 | 0 | 508 | 0 |
| name | 0 | 1 | 5 | 48 | 0 | 505 | 0 |
| ipo_year | 0 | 1 | 3 | 4 | 0 | 33 | 0 |
| sector | 0 | 1 | 6 | 21 | 0 | 12 | 0 |
| industry | 0 | 1 | 5 | 62 | 0 | 103 | 0 |
| summary_quote | 0 | 1 | 31 | 34 | 0 | 508 | 0 |
unique_companies <- nyse %>%
distinct(symbol, name,ipo_year,.keep_all = TRUE)
#divide by sector
companies_per_sector <- unique_companies %>%
group_by(sector) %>%
count(sort=TRUE)
companies_per_sector ## # A tibble: 12 x 2
## # Groups: sector [12]
## sector n
## <chr> <int>
## 1 Finance 97
## 2 Consumer Services 79
## 3 Public Utilities 60
## 4 Capital Goods 45
## 5 Health Care 45
## 6 Energy 42
## 7 Technology 40
## 8 Basic Industries 39
## 9 Consumer Non-Durables 31
## 10 Miscellaneous 12
## 11 Transportation 10
## 12 Consumer Durables 8
companies_per_sector_plot <- unique_companies %>%
group_by(sector) %>%
count(sort=TRUE) %>%
ggplot(aes( x = n, y = reorder(sector,n)))+
geom_col()+
theme_bw()+
labs(title = "Number of companies per Sector",
subtitle = "Bar Chart",
x = "Number of Companies",
y = NULL)
companies_per_sector_plotComment:
The NYSE contains many more traditional industries (e.g. banks and insurance companies) which is why Finance and Consumer Services companies make up the highest number in the dataset. Many Technology companies instead choose to list on the NASDAQ, which explains why there are proportionately fewer Tech companies in this dataset.
SPY which is an SP500 ETF (Exchange Traded Fund). The following codedjia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
#get tables that exist on URL
tables <- djia_url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called djia.
# Use purr::map() to create a list of all tables in URL
djia <- map(tables, . %>%
html_table(fill=TRUE)%>%
clean_names())
# constituents
table1 <- djia[[2]] %>% # the second table on the page contains the ticker symbols
mutate(date_added = ymd(date_added),
# if a stock is listed on NYSE, its symbol is, e.g., NYSE: MMM
# We will get prices from yahoo finance which requires just the ticker
# if symbol contains "NYSE*", the * being a wildcard
# then we jsut drop the first 6 characters in that string
ticker = ifelse(str_detect(symbol, "NYSE*"),
str_sub(symbol,7,11),
symbol)
)
# we need a vector of strings with just the 30 tickers + SPY
tickers <- table1 %>%
select(ticker) %>%
pull() %>% # pull() gets them as a sting of characters
c("SPY") # and lets us add SPY, the SP500 ETF# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, # cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- tickers %>%
tq_get(get = "stock.prices",
from = "2000-01-01",
to = Sys.Date()) %>% # Sys.Date() returns today's price
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 161,491
## Columns: 8
## Groups: symbol [31]
## $ symbol <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM"~
## $ date <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07, ~
## $ open <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, 4~
## $ high <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, 4~
## $ low <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, 4~
## $ close <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, 4~
## $ volume <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 2357600~
## $ adjusted <dbl> 27.2, 26.1, 26.9, 29.0, 29.6, 29.4, 28.9, 29.0, 29.0, 28.6, 2~
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))SPY; min, max, median, mean, SD.table_monthly_returns <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(
mean_monthly_returns = mean(monthly_returns),
min_ret=min(monthly_returns),
max_ret = max(monthly_returns),
mean_ret = mean(monthly_returns),
median_ret = median(monthly_returns),
sd_ret = sd(monthly_returns))
table_monthly_returns## # A tibble: 31 x 7
## symbol mean_monthly_returns min_ret max_ret mean_ret median_ret sd_ret
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.0269 -0.577 0.454 0.0269 0.0338 0.115
## 2 AMGN 0.00826 -0.170 0.328 0.00826 0.00936 0.0733
## 3 AXP 0.0101 -0.279 0.875 0.0101 0.0112 0.0922
## 4 BA 0.0127 -0.458 0.459 0.0127 0.0167 0.0926
## 5 CAT 0.0143 -0.353 0.350 0.0143 0.0133 0.0899
## 6 CRM 0.0263 -0.360 0.403 0.0263 0.0199 0.111
## 7 CSCO 0.00611 -0.367 0.389 0.00611 0.0117 0.0963
## 8 CVX 0.00852 -0.224 0.273 0.00852 0.0101 0.0653
## 9 DIS 0.0105 -0.268 0.234 0.0105 0.00995 0.0737
## 10 DOW 0.0149 -0.276 0.255 0.0149 0.0367 0.111
## # ... with 21 more rows
Comment:
The Dow Jones Industrial Average is a stock price-weighted index composed of 30 of the most important publicly-traded companies in the United. Due to this measure, we can observe some companies which have a smaller weighting have a larger market capitalization value. An example being Honeywell, which has a 4.29% weighting has a market capitalization of $152 billion. On the other hand, Apple only has a weighting 2.78%, yet has a total market capitalization of $2.43 trillion.
geom_density(), for each of the stocksdensity_monthly_returns <-
ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
geom_density()+
facet_wrap(~symbol)+
theme_bw()+
labs(title = "Monthly returns for each stock",
subtitle="Density Plot",
x = "Monthly Returns",
y = "Density") +
NULL
density_monthly_returnsWhat can you infer from this plot? Which stock is the riskiest? The least risky?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
Stocks such as AAPL, DOW and INTC have the largest standard deviations presented in the previous table. Given the nature of their cyclical businesses, they present higher levels of risk. Riskier companies tend to display more variability in terms of monthly returns and lower levels of kurtosis. Non-cyclical stocks such as PG, KO, JNJ tend to underperform cyclical stocks in a bull market, and will also show lower levels of kurtosis, as they perceive less extreme returns. These companies can be perceived as safer.
ggrepel::geom_text_repel() to label each stocklibrary(ggrepel)
risk_return_plot <-
ggplot(table_monthly_returns, aes(x = sd_ret , y = mean_monthly_returns))+
geom_point()+
geom_smooth()+
theme_bw()+
labs(title = "Relationship between monthly returns and standard deviation",
subtitle="Scatter Plot",
x = "Standard Deviation",
y = "Monthly Returns") +
geom_text_repel(data = table_monthly_returns, aes(x = sd_ret , y = mean_monthly_returns, label = symbol))+
NULL
risk_return_plotWhat can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
The graphs display different stock returns and their respective standard deviations representing the Efficient-market hypothesis. The stock with the lowest standard deviation, being the SPY (S&P 500 ETF) displays similar returns to other stocks with much higher risk. This can be explained by the previous theory, as unsystematic risk is not prized with higher returns. The relationship between stock returns and risk shows a very weak positive relationship. Excluding outliers such as AAPL and CRM, most stocks performance does not increase with risk. An example being JPM and CSCO stocks, which both have much more risk than stocks such as MCD PG, yet display similar returns.
The surge in inflation seen across major economies is probably short lived because it’s confined to just a few sectors of the economy, according to the Bank for International Settlements.
New research by the BIS’s Claudio Borio, Piti Disyatat, Egon Zakrajsek and Dora Xia adds to one of the hottest debates in economics – how long the current surge in consumer prices will last. Both Federal Reserve Chair Jerome Powell and his euro-area counterpart Christine Lagarde have said the pickup is probably transitory, despite a snarled global supply chain and a spike in energy prices.
You have to download data for CPI and the 10 year bill and produce the following graph
The relevant indicators from FRED are:
cpi<- tq_get("CPIAUCSL", get = "economic.data",
from = "1980-01-01") %>%
rename(cpi = symbol, # FRED data is given as 'symbol' and 'price'
rate = price) %>% # we rename them to what they really are, e.g., cpi and rate
# calculate yearly change in CPI by dividing current month by same month a year (or 12 months) earlier, minus 1
mutate(cpi_yoy_change = rate/lag(rate, 12) - 1)
ten_year_monthly <- tq_get("GS10", get = "economic.data",
from = "1980-01-01") %>%
rename(ten_year = symbol,
yield = price) %>%
mutate(yield = yield / 100) # original data is not given as, e.g., 0.05, but rather 5, for five percent
# we have the two dataframes-- we now need to join them, and we will use left_join()
# base R has a function merge() that does the same, but it's slow, so please don't use it
mydata <-
cpi %>%
left_join(ten_year_monthly, by="date") %>%
mutate(
year = year(date), # using lubridate::year() to generate a new column with just the year
month = month(date, label = TRUE),
decade=case_when(
year %in% 1980:1989 ~ "1980s",
year %in% 1990:1999 ~ "1990s",
year %in% 2000:2009 ~ "2000s",
year %in% 2010:2019 ~ "2010s",
TRUE ~ "2020s"
)
)
mydata## # A tibble: 500 x 9
## cpi date rate cpi_yoy_change ten_year yield year month decade
## <chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <ord> <chr>
## 1 CPIAUCSL 1980-01-01 78 NA GS10 0.108 1980 Jan 1980s
## 2 CPIAUCSL 1980-02-01 79 NA GS10 0.124 1980 Feb 1980s
## 3 CPIAUCSL 1980-03-01 80.1 NA GS10 0.128 1980 Mar 1980s
## 4 CPIAUCSL 1980-04-01 80.9 NA GS10 0.115 1980 Apr 1980s
## 5 CPIAUCSL 1980-05-01 81.7 NA GS10 0.102 1980 May 1980s
## 6 CPIAUCSL 1980-06-01 82.5 NA GS10 0.0978 1980 Jun 1980s
## 7 CPIAUCSL 1980-07-01 82.6 NA GS10 0.102 1980 Jul 1980s
## 8 CPIAUCSL 1980-08-01 83.2 NA GS10 0.111 1980 Aug 1980s
## 9 CPIAUCSL 1980-09-01 83.9 NA GS10 0.115 1980 Sep 1980s
## 10 CPIAUCSL 1980-10-01 84.7 NA GS10 0.118 1980 Oct 1980s
## # ... with 490 more rows
library(ggrepel)
my_plot <- ggplot(mydata, aes(x= cpi_yoy_change, y= yield, color= decade, label = date)) +
geom_point() +
geom_smooth(method=lm) +
theme_bw() +
theme(legend.position= "none") +
facet_wrap(~decade, nrow= 5, scales ="free") +
labs(title = "How are CPI and 10-year yield related?",
x = "CPI Yearly Change",
y = "10-year Treasury Constant Maturity Rate") +
theme(axis.title.x = element_text(face = "bold")) +
theme(axis.title.y = element_text(face = "bold")) +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
geom_text_repel(aes(x= cpi_yoy_change, y= yield,label = paste(mydata$month,mydata$year))) +
NULL
my_plot Comment:
There is a strong positive correlation between inflation and treasury yields, but this changes in the 2000s with the global financial crisis in 2008 with very high treasury yields, but negative inflation. The global financial crisis caused the consumer sentiment at the time to go down drastically, but investors were still eager to buy the 10-year treasury bonds as they still had long-term faith in the economy and saw the bonds as a safe haven asset. The positive correlation can especially be seen for 2021 as both yields and inflation has recovered from deep-low levels during 2020. Moreover, we can observe the general trend of yields decreasing over time. After the 2008 financial crisis, we entered a period of quantitative easing which greatly decreased overall interest rates. For the 2020s period, the graph has a reduced range for the Y-variable “10-year Treasury Constant Maturity Rate”, and positive levels for X-variable “Inflation”. It is paramount to bear in mind that the 2020s period only have years 2020 and 2021 in data, while the rest of decades has 10-year full data.
Data Source:
library(scales)
library(ggpubr)
trump_vote<-election2020_results %>%
filter(mode=="TOTAL"|mode=="ELECTION DAY") %>%
select(fips, candidate, candidatevotes,totalvotes) %>%
filter(candidate=="DONALD J TRUMP") %>%
mutate(trump_vote=candidatevotes/totalvotes) %>%
group_by(fips) %>%
summarise(votes=max(trump_vote))
fully_vaccinated<-vaccinations %>%
mutate(month = substring(date,1,2), label = TRUE)%>%
filter(month== '08' ,series_complete_pop_pct!=0) %>%
group_by(fips) %>%
summarise(max_complete=max(series_complete_yes))
table_new<-population %>%
left_join(trump_vote, by="fips") %>%
left_join(fully_vaccinated, by="fips") %>%
mutate(pct=max_complete/pop_estimate_2019)
table_new <-na.omit(table_new)
table_new[table_new$votes>1,3]<-1
table_new[table_new$pct>1,5]<-1
vaccination_plot <- ggplot(table_new,aes(x=votes,y=pct, size=pop_estimate_2019))+
geom_point(color= "deepskyblue", alpha=0.5)+
geom_smooth(method=lm, se= FALSE )+
geom_hline(yintercept=0.508,linetype="dashed") +
geom_text(aes(0,0.508,label = "ACTUAL 50.8%", hjust=0.2, vjust = -0.5), size= 2.5)+
geom_hline(yintercept=0.539,linetype="dashed") +
geom_text(aes(0,0.539,label = "TARGET 53.9%", hjust=0.2, vjust = -0.5), size= 2.5)+
geom_hline(yintercept=0.85,linetype="dashed") +
geom_text(aes(0,0.85,label = "Herd Immunity Threshold (?)",hjust=0.1, vjust = -0.5), size= 2.5)+
scale_size(range = c(.1, 24))+
scale_y_continuous(breaks = seq (0, 1, by = 0.05), limits= c(0,1), labels=percent_format(accuracy=1)) +
scale_x_continuous(breaks = seq (0, 1, by = 0.05), limits= c(0,1),labels=percent_format(accuracy=1)) +
labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
subtitle= "(most states based on FULLY vaccinated only; CA, GA, IA, MI & TX based on total doses administered)
Data via Centers for Disease Control, COVID Act Now, state health depts
Graph by Charles Gaba / ACASignups.net",
x = "2020 Trump Vote %",
y = "% of Total Population Vaccinated") +
annotate("text",x=-Inf,y=Inf,hjust=-1.5,vjust=2.3,label="EVERY U.S. COUNTY") +
theme_bw()+
theme(legend.position= "none", plot.title = element_text(hjust=0.5), plot.subtitle = element_text(hjust=0.5), aspect.ratio=0.95)+
stat_regline_equation(label.x=0, label.y=0.035) +
stat_cor(aes(label=..rr.label..), label.x=0, label.y=0)+
NULL
vaccination_plot Comment:
There is a slight negative correlation between % voting for Trump and % of total people vaccinated. This makes sense as Trump’s policies were very COVID-sceptic and vaccine-sceptic, so the counties voting for him would exhibit the same tendencies. There are 77 counties below the line of best fit in the majority non-Trump vote. Out of these counties, 86% are non-white majority. In these counties, the general level of wealth is lower. For example, many of these counties are in Alabama and Mississippi whose GDP/capita are $41,000 and $35,000 respectively. The average GDP/capita in the US is $61,000 which shows that these countries are at least 33% below the average. This lower level of wealth means that in these counties, there is probably less access to good-quality education which is a key factor in getting vaccinated. Without these counties, we would observe on the graph a much stronger negative correlation between % voting for Trump and % of total people vaccinated.
The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
The following code will scrape the Wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# similar graphs and analyses can be found at
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
polls## [[1]]
## # A tibble: 252 x 13
## polling_firm fieldwork_date samplesize abs union spd af_d fdp linke
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize "Abs~ NA NA NA NA NA
## 2 2021 federal e~ 26 Sep 2021 – "" NA NA NA NA NA
## 3 Wahlkreisprogn~ 22–24 Sep 2021 1,400 "–" 22.5 25.5 11 12 7
## 4 Ipsos 22–23 Sep 2021 2,000 "–" 22 26 11 12 7
## 5 Forschungsgrup~ 22–23 Sep 2021 1,273 "–" 23 25 10 11 6
## 6 Forsa 20–23 Sep 2021 2,002 "26" 22 25 10 12 6
## 7 Allensbach 16–23 Sep 2021 1,554 "–" 25 26 10 10.5 5
## 8 Civey 16–23 Sep 2021 10,012 "–" 23 25 10 12 6
## 9 YouGov 16–22 Sep 2021 2,364 "–" 21 25 12 11 7
## 10 Wahlkreisprogn~ 20–21 Sep 2021 1,801 "–" 21.5 25 11 12.5 6.5
## # ... with 242 more rows, and 4 more variables: grune <dbl>, fw <chr>,
## # others <chr>, lead <chr>
##
## [[2]]
## # A tibble: 226 x 12
## polling_firm fieldwork_date samplesize abs union spd af_d fdp linke
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize Abs. NA NA NA NA NA
## 2 Forsa 21–23 Dec 2020 1,501 – 36 15 9 6 9
## 3 INSA 18–21 Dec 2020 2,055 – 35 16 11 8 7.5
## 4 Forsa 14–18 Dec 2020 2,505 23 37 15 8 6 8
## 5 Kantar 10–16 Dec 2020 2,401 – 35 17 10 6 8
## 6 INSA 11–14 Dec 2020 2,002 – 36 17 10 7 7.5
## 7 Forsa 7–11 Dec 2020 2,503 23 37 15 8 5 8
## 8 Allensbach 28 Nov–10 Dec~ 1,022 – 37 16.5 9 7 7
## 9 Infratest dimap 7–9 Dec 2020 1,004 – 36 16 9 6 7
## 10 Forschungsgrup~ 7–9 Dec 2020 1,246 25 37 16 10 5 8
## # ... with 216 more rows, and 3 more variables: grune <dbl>, others <chr>,
## # lead <chr>
##
## [[3]]
## # A tibble: 234 x 12
## polling_firm fieldwork_date samplesize abs union spd af_d fdp linke
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize Abs. NA NA NA NA NA
## 2 INSA 20–23 Dec 2019 2,034 – 28 13 15 10 8
## 3 Forsa 16–20 Dec 2019 2,501 23 28 13 13 8 8
## 4 Emnid 12–18 Dec 2019 1,899 – 27 15 14 9 9
## 5 YouGov 13–17 Dec 2019 1,586 – 27 13 15 7 10
## 6 INSA 13–16 Dec 2019 2,020 – 27 13 15 8.5 9.5
## 7 Forsa 9–13 Dec 2019 2,502 22 28 12 13 8 9
## 8 Forschungsgrup~ 10–12 Dec 2019 1,366 – 27 13 14 8 9
## 9 Infratest dimap 10–11 Dec 2019 1,038 – 27 14 15 8 8
## 10 Emnid 5–11 Dec 2019 1,978 – 28 16 13 9 9
## # ... with 224 more rows, and 3 more variables: grune <dbl>, others <chr>,
## # lead <chr>
##
## [[4]]
## # A tibble: 253 x 12
## polling_firm fieldwork_date samplesize abs union spd af_d fdp linke
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize Abs. NA NA NA NA NA
## 2 YouGov 21–27 Dec 2018 1,665 – 30 16 14 8 10
## 3 INSA 20–21 Dec 2018 1,029 – 29 14 15 9 9
## 4 Forsa 17–21 Dec 2018 2,504 24 31 14 13 8 8
## 5 Emnid 13–19 Dec 2018 1,889 – 29 15 14 9 9
## 6 INSA 14–17 Dec 2018 2,077 – 29 15 15 9.5 9.5
## 7 Forsa 10–14 Dec 2018 2,507 24 32 15 12 8 8
## 8 Forschungsgrup~ 11–13 Dec 2018 1,268 – 30 15 15 7 9
## 9 Infratest dimap 11–12 Dec 2018 1,048 – 31 15 13 8 8
## 10 Emnid 6–12 Dec 2018 1,997 – 30 15 14 8 8
## # ... with 243 more rows, and 3 more variables: grune <dbl>, others <chr>,
## # lead <chr>
##
## [[5]]
## # A tibble: 68 x 12
## polling_firm fieldwork_date samplesize abs union spd af_d fdp linke
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize Abs. NA NA NA NA NA
## 2 YouGov 25–27 Dec 2017 1,489 – 33 21 13 9 10
## 3 INSA 21–22 Dec 2017 2,203 – 33 20.5 13 10.5 10
## 4 Forsa 18–22 Dec 2017 2,504 21 34 19 12 8 10
## 5 Emnid 14–20 Dec 2017 1,861 – 33 21 12 8 10
## 6 INSA 15–18 Dec 2017 2,031 – 31 21 14 9 11
## 7 Forsa 11–15 Dec 2017 2,501 21 33 20 12 8 10
## 8 Allensbach 1–14 Dec 2017 1,443 – 34 21 11 10 9
## 9 Infratest dimap 11–13 Dec 2017 1,029 – 32 20 13 9 9
## 10 Emnid 7–13 Dec 2017 1,865 – 32 22 13 8 9
## # ... with 58 more rows, and 3 more variables: grune <dbl>, others <chr>,
## # lead <chr>
##
## [[6]]
## # A tibble: 211 x 12
## polling_firm fieldwork_date samplesize cdu spd af_d fdp linke grune
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 INSA 17–20 Sep 2021 2,054 17 25 11 12 6.5 15
## 3 INSA 10–13 Sep 2021 2,062 16 26 11.5 12.5 6.5 15
## 4 INSA 3–6 Sep 2021 2,052 15.5 26 11 12.5 6.5 15.5
## 5 INSA 27–30 Aug 2021 2,015 15 25 11 13.5 7 16.5
## 6 INSA 20–23 Aug 2021 2,119 18 23 11 13 7 17
## 7 INSA 13–16 Aug 2021 2,080 20 20 11 12.5 6.5 17.5
## 8 INSA 6–9 Aug 2021 2,118 20.5 17.5 11.5 12.5 6.5 17.5
## 9 INSA 30 Jul–2 Aug 2021 2,080 22.5 18 11 13 7 18
## 10 INSA 23–26 Jul 2021 2,007 22 17.5 12 13 6 17.5
## # ... with 201 more rows, and 3 more variables: csu <dbl>, others <chr>,
## # lead <chr>
##
## [[7]]
## # A tibble: 4 x 11
## polling_firm fieldwork_date samplesize cdu spd grune fdp af_d linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Forsa 29 Jan – 1 Feb~ 1,007 33 10 24 10 12 6
## 3 Forsa 8–22 Feb 2018 1,003 35 13 17 11 12 8
## 4 2017 federal e~ 24 Sep 2017 – 34.4 16.4 13.5 12.7 12.2 6.4
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[8]]
## # A tibble: 24 x 12
## polling_firm fieldwork_date samplesize csu spd af_d fdp grune linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 GMS 13–15 Sep 2021 1,002 29 17 11 13 16 5
## 3 GMS 8–13 Sep 2021 1,005 28 18 11 12 17 4
## 4 Infratest dimap 2–6 Sep 2021 1,195 28 18 10 12 16 3
## 5 GMS 1–6 Sep 2021 1,003 29 15 10 13 18 3
## 6 GMS 21–27 Jul 2021 1,003 35 9 9 12 20 4
## 7 INSA 12–19 Jul 2021 1,000 36 11 9 9 19 4
## 8 Infratest dimap 2–6 Jul 2021 1,186 36 9 10 11 18 4
## 9 GMS 16–21 Jun 2021 1,007 34 10 9 11 22 3
## 10 GMS 11–17 May 2021 1,003 32 8 10 11 26 4
## # ... with 14 more rows, and 3 more variables: free_voters <chr>, others <chr>,
## # lead <chr>
##
## [[9]]
## # A tibble: 30 x 11
## polling_firm fieldwork_date samplesize cdu linke spd grune af_d fdp
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Forsa 29 Jan – 6 Feb 2~ 1,001 21 16 12 26 10 6
## 3 Forsa 12–19 Dec 2019 1,005 21 16 13 24 10 7
## 4 Forsa 21–26 Nov 2019 1,006 20 16 14 24 11 7
## 5 Forsa 22–31 Oct 2019 1,002 22 15 13 24 11 6
## 6 Forsa 17–26 Sep 2019 1,002 22 15 13 24 11 7
## 7 Forsa 20–29 Aug 2019 1,003 22 17 14 24 10 7
## 8 Forsa 17–25 Jul 2019 1,001 22 17 14 24 10 6
## 9 Forsa 17–27 Jun 2019 1,004 21 16 14 24 11 7
## 10 Forsa 20–27 May 2019 1,006 19 16 14 25 12 7
## # ... with 20 more rows, and 2 more variables: others <chr>, lead <chr>
##
## [[10]]
## # A tibble: 6 x 11
## polling_firm fieldwork_date samplesize cdu af_d spd linke fdp grune
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Infratest dimap 25–28 Aug 2021 1,157 15 18 29 11 9 9
## 3 Forsa 10–15 Dec 2020 1,001 28 17 15 12 4 17
## 4 Forsa 17–20 Dec 2018 1,005 23 20 12 18 7 14
## 5 Forsa 7–9 Nov 2017 1,002 26 20 16 19 8 6
## 6 2017 federal el~ 24 Sep 2017 – 26.7 20.2 17.6 17.2 7.1 5
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[11]]
## # A tibble: 8 x 11
## polling_firm fieldwork_date samplesize cdu spd grune linke fdp af_d
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Trend Research 13–16 Sep 2021 703 14 33 18 9 14 7
## 3 Trend Research 30 Aug–2 Sep 2~ 708 15 34 17 10 13 7
## 4 Trend Research 12–18 Aug 2021 702 17 28 19 9 14 7
## 5 Forsa 18 Dec 2019–6 ~ 1,069 21 16 29 11 8 9
## 6 Universität Ha~ 6 Jan–2 Mar 20~ 1,069 20 24 32 8 10 3
## 7 Forsa 27 Dec 2018–3 ~ 1,004 22 17 26 11 10 9
## 8 2017 federal e~ 24 Sep 2017 – 27.2 23.5 13.9 12.2 10.8 7.8
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[12]]
## # A tibble: 3 x 11
## polling_firm fieldwork_date samplesize cdu spd af_d fdp grune linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Forsa 8–22 Feb 2018 1,035 31 19 10 12 14 10
## 3 2017 federal el~ 24 Sep 2017 – 30.9 23.5 11.9 11.5 9.7 8.1
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[13]]
## # A tibble: 16 x 11
## polling_firm fieldwork_date samplesize cdu af_d linke spd fdp grune
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 INSA 10–16 Sep 2021 1,000 14 19 11 38 6 9
## 3 Infratest dimap 2–7 Sep 2021 1,180 16 18 11 31 9 8
## 4 Infratest dimap 19–24 Aug 2021 1,153 19 18 12 29 9 7
## 5 INSA 6–12 Aug 2021 1,000 21 18 12 25 8 10
## 6 INSA 16–22 Jul 2021 1,098 26 21 13 17 8 11
## 7 Infratest dimap 8–13 Jul 2021 1,159 30 17 13 17 8 8
## 8 Infratest dimap 12–18 May 2021 1,245 23 16 12 18 7 16
## 9 Forsa 12–15 Jan 2021 1,002 32 15 16 12 5 13
## 10 Infratest dimap 18–21 Nov 2020 1,000 34 15 12 20 4 10
## 11 Forsa 2–10 Jan 2020 1,002 27 20 16 8 6 14
## 12 Forsa 18–23 Sep 2019 1,002 28 21 14 10 6 14
## 13 Forsa 4–11 Jan 2019 1,007 32 20 16 11 5 11
## 14 Forsa 24–28 Jun 2018 1,002 30 21 19 12 5 8
## 15 Forsa 2–11 Jan 2018 1,001 33 18 18 15 5 6
## 16 2017 federal e~ 24 Sep 2017 – 33.1 18.6 17.8 15.1 6.2 4.3
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[14]]
## # A tibble: 5 x 11
## polling_firm fieldwork_date samplesize cdu spd fdp af_d grune linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Forsa 19–28 May 2020 1,002 40 20 5 7 16 6
## 3 Forsa 1–8 Feb 2019 1,010 33 19 8 9 19 7
## 4 Forsa 8–22 Feb 2018 1,004 36 24 9 8 12 7
## 5 2017 federal el~ 24 Sep 2017 – 34.9 27.4 9.3 9.1 8.7 7
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[15]]
## # A tibble: 6 x 11
## polling_firm fieldwork_date samplesize cdu spd fdp af_d grune linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Forsa 10–17 May 2021 1,058 24 18 11 8 28 4
## 3 Forsa Mar–Apr 2021 1,090 26 18 11 8 25 6
## 4 Forsa 1–16 Aug 2019 1,505 27 17 10 9 25 6
## 5 Forsa 8–22 Feb 2018 1,015 34 21 11 9 11 8
## 6 2017 federal el~ 24 Sep 2017 – 32.6 26 13.1 9.4 7.6 7.5
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[16]]
## # A tibble: 7 x 12
## polling_firm fieldwork_date samplesize cdu spd af_d fdp grune linke
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 Infratest dimap 2–7 Sep 2021 1,160 23 30 10 11 13 4
## 3 Infratest dimap 9–13 Jul 2021 1,153 30 22 11 9 17 4
## 4 Forsa 1–8 Feb 2019 1,005 34 16 10 10 18 7
## 5 Infratest dimap 5–7 Mar 2018 1,001 34 23 12 9 12 7
## 6 Infratest dimap 8–12 Dec 2017 1,003 39 24 9 8 12 5
## 7 2017 federal el~ 24 Sep 2017 – 35.9 24.2 11.2 10.4 7.6 6.8
## # ... with 3 more variables: free_voters <chr>, others <chr>, lead <chr>
##
## [[17]]
## # A tibble: 5 x 11
## polling_firm fieldwork_date samplesize af_d cdu linke spd fdp grune
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 INSA 6–13 Sep 2021 1,000 26 18 11 18 11 8
## 3 Infratest dimap 13–18 Aug 2021 1,179 23 21 11 15 12 10
## 4 INSA 2–9 Aug 2021 1,001 25 24 13 10 11 10
## 5 2017 federal el~ 24 Sep 2017 – 27 26.9 16.1 10.5 8.2 4.6
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[18]]
## # A tibble: 3 x 11
## polling_firm fieldwork_date samplesize cdu af_d linke spd fdp grune
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm Fieldwork date Samplesize NA NA NA NA NA NA
## 2 INSA 16–23 Aug 2021 1,000 18 22 18 21 9 5
## 3 2017 federal el~ 24 Sep 2017 – 28.8 22.7 16.9 13.2 7.8 4.1
## # ... with 2 more variables: others <chr>, lead <chr>
##
## [[19]]
## # A tibble: 69 x 25
## polling_firm fieldwork_date samplesize spd_union spd_union_2 union_fdp
## <chr> <chr> <chr> <int> <int> <chr>
## 1 Polling firm Fieldwork date Samplesize NA NA ""
## 2 Forschungsgruppe ~ 14–16 Sep 2021 1,406 9 9 "8"
## 3 Forschungsgruppe ~ 7–9 Sep 2021 1,281 9 9 "7"
## 4 Forschungsgruppe ~ 31 Aug–2 Sep 2~ 1,301 12 12 "8"
## 5 Forschungsgruppe ~ 24–26 Aug 2021 1,300 10 10 "6"
## 6 Forschungsgruppe ~ 10–12 Aug 2021 1,252 12 12 "10"
## 7 Forschungsgruppe ~ 27–29 Jul 2021 1,268 13 13 "10"
## 8 Forschungsgruppe ~ 13–15 Jul 2021 1,224 13 13 "11"
## 9 Forschungsgruppe ~ 22–24 Jun 2021 1,271 11 11 "11"
## 10 Civey 8–9 Jun 2021 5,005 6 6 "–"
## # ... with 59 more rows, and 19 more variables: union_fdp_2 <chr>,
## # spd_grune <chr>, spd_grune_2 <chr>, union_grune <int>, union_grune_2 <int>,
## # spd_grune_linke <int>, spd_grune_linke_2 <int>, spd_grune_linke_3 <int>,
## # union_grune_fdp <int>, union_grune_fdp_2 <int>, union_grune_fdp_3 <int>,
## # spd_union_fdp <chr>, spd_union_fdp_2 <chr>, spd_union_fdp_3 <chr>,
## # spd_grune_fdp <chr>, spd_grune_fdp_2 <chr>, spd_grune_fdp_3 <chr>,
## # spdfdp <chr>, spdfdp_2 <chr>
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)
library(plotly)
german_election_polls_new <- german_election_polls %>%
arrange(end_date) %>%
group_by(end_date) %>%
summarise(CDU_mean = mean(union),
AFD_mean = mean(af_d),
SPD_mean = mean(spd),
FDP_mean = mean(fdp),
Linke_mean = mean(linke),
Grune_mean = mean(grune))
#calculate 14-day rolling average
election_14_rolling_mean <- german_election_polls_new %>% #this is a new object
mutate( CDU = zoo::rollmean(CDU_mean, k=14, fill = NA),
AFD = zoo::rollmean(AFD_mean, k=14, fill = NA),
SPD = zoo::rollmean(SPD_mean, k=14, fill = NA),
FDP = zoo::rollmean(FDP_mean, k=14, fill = NA),
Linke = zoo::rollmean(Linke_mean, k=14, fill = NA),
Grune = zoo::rollmean(Grune_mean, k=14, fill = NA))
election_14_rolling_mean## # A tibble: 146 x 13
## end_date CDU_mean AFD_mean SPD_mean FDP_mean Linke_mean Grune_mean CDU
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-04 36.5 10 15.5 6.75 7.75 18 NA
## 2 2021-01-05 36 10 15 6 9 18 NA
## 3 2021-01-06 35 10 14 7 7 21 NA
## 4 2021-01-08 36 8 14 7 8 20 NA
## 5 2021-01-11 36 10 15 7.5 8 18 NA
## 6 2021-01-13 36 10 15 7 8 18 NA
## 7 2021-01-14 37 10 15 5 8 20 35.9
## 8 2021-01-15 35 9 15 6 8 20 35.8
## 9 2021-01-17 35 9 15 7 8 19 35.7
## 10 2021-01-18 35.2 11 14.5 8.75 7.75 17.2 35.7
## # ... with 136 more rows, and 5 more variables: AFD <dbl>, SPD <dbl>,
## # FDP <dbl>, Linke <dbl>, Grune <dbl>
#form new dataframe of each party
CDU_df <- election_14_rolling_mean %>%
select(CDU, end_date) %>%
rename(percentage = CDU)
AFD_df <- election_14_rolling_mean %>%
select(AFD, end_date) %>%
rename(percentage = AFD)
SPD_df <- election_14_rolling_mean %>%
select(SPD, end_date) %>%
rename(percentage = SPD)
FDP_df <- election_14_rolling_mean %>%
select(FDP, end_date) %>%
rename(percentage = FDP)
Linke_df <- election_14_rolling_mean %>%
select(Linke, end_date) %>%
rename(percentage = Linke)
Grune_df <- election_14_rolling_mean %>%
select(Grune, end_date) %>%
rename(percentage = Grune)
#Plotting the data
election_polls <- ggplot(CDU_df, aes(x = end_date, y = percentage)) +
geom_point(color = "grey", alpha = 0.3) +
geom_smooth(color = "black",se = FALSE) +
geom_point(data = AFD_df,color = "navyblue",alpha = 0.3) +
geom_smooth(data = AFD_df,color = "navyblue",se = FALSE) +
geom_point(data = SPD_df,color = "red3",alpha = 0.3) +
geom_smooth(data = SPD_df,color = "red3",se = FALSE) +
geom_point(data = FDP_df,color = "yellow2",alpha = 0.3) +
geom_smooth(data = FDP_df,color = "yellow2",se = FALSE) +
geom_point(data = Linke_df,color = "purple3",alpha = 0.3) +
geom_smooth(data = Linke_df,color = "purple3",se = FALSE) +
geom_point(data = Grune_df,color = "green3",alpha = 0.3) +
geom_smooth(data = Grune_df,color = "green3",se = FALSE) +
theme_bw()+
scale_y_continuous(breaks = seq(0,45, by = 5))+
labs(title = "Opinion Polling for the 2021 German Federal Election",
x = "Date",
y = "Predicted votes (%)",
caption = "Source:https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
theme(plot.caption = element_text(hjust = 1),
plot.title = element_text(hjust=0.5)) +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank()) +
NULL
ggplotly(election_polls)Comment:
The CDU gained popularity in March of 2020 for its policies as the pandemic started. For example, in Spain and Peru, the governing party also gained popularity at the start of the pandemic. People believed that the government’s decisions about lockdowns and financial support were optimal which may explain the higher popularity. After Armin Laschet was elected chairman of the Angela Merkel’s Christian Democrat party in January, the CDU progressively lost popularity as he is seen to be an unfavourable candidate to replace Merkel and he is viewed by political critics as lacking empathy. The Social Democrats have taken the lead in the polls for the first time in 15 years. This is due to the unpopularity of the CDU and Germans now are very concerned with the level of indebtedness in the country. They want looser fiscal policy and taxes on the rich and wealthy which has attracted a lot of support among the German population.